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Abstract 

We consider the problem of testing distribution identity. Given a sequence of independent samples 
from an unknown distribution on a domain of size n, the goal is to check if the unknown distribution ap- 
proximately equals a known distribution on the same domain. While Batu, Fortnow, Fischer, Kumar, Ru- 
binfeld, and White (FOCS 2001) proved that the sample complexity of the problem is •poly(l /e)), 

the running time of their tester is much higher: 0{n) + d{y/ri ■ poly(l/e)). We modify their tester to 
achieve a running time of (5(-yn •poly(l /e)). 

Let p and q be two probability distributions on [«J3. and let \\p — q\\i denote the ^i-distance between p 
and q. In this paper, algorithms have access to two distributions q and p. 

• The distribution p is known: for each / G [n], the algorithm can query the probability pi of / in constant 
time. 

• The distribution q is unknown: the algorithm can only obtain an independent sample from q in constant 
time. 

An identity tester is an algorithm such that: 

• if p = q, then it accepts with probability 2/3, 

• if \\p — q\\ \ > £, then it rejects with probability 2/3. 

Batu, Fortnow, Fischer, Kumar, Rubinfeld, and White llBFF"^Olt proved that there is an identity tester 
that uses only 0(y^ • poly(l/£)) samples from q. A shortcoming of their algorithm is a running time of 
0{n) + 0{^/n■^o\y{\/^)). In this note, we show that their tester can be modified to achieve a running time 
of 0(y^-poly(l/£)). It is also well known that Q.{y/n) samples are required to tell the uniform distribution 
on [n] from a distribution that is uniform on a random subset of [n] of size n/2. 

1 The Original Tester 

We now describe the tester of Batu et al. jBFF+Olt . which is outlined as Algorithm [T] Let e' = e/C, 
where C is a sufficiently large positive constant. The tester starts by partitioning the set [n] into k + I = 
[logi+g, ^] + 1 = 6>(i • log(?i/£)) sets Ro,Ri,...,Rk in Step 1, where 

= < « S I; ■ (1 +e'y} 

'We write [k] to denote the set {1,2, .. . ,k}, for any positive integer k. 
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Algorithm 1: Outline of the tester of Batu et al. IB FF+Oll 

1 Partition [n] into Rq, R\, .. .,Rk 

2 Compute Pj, for j G {0, 1 , . . . , k} 

3 Use 0{{k/zY • logk) samples from q to get an estimate Q'^ of each Qj up to e/(4^ + 4) 

4 if ||(Po,...,/\)-(Go,...,el)||i >e/4thenREJECT 

s Let Si, i S [n], be the number of occurrences of / in a sample of size S = 0{^/n ■ poly(l/8)) 

6 for j > s.t. Pj > e/{4k + 4) do 

7 L (2) > (1+^/4) • (f) -^y • 1^(1 then REJECT 

8 ACCEPT 



for j > 0, and 

/?o = {/GH:p.<A}. 

We then define probabilities of each set according to p and q: Pj = Y.ieRjPi ^"^^ Qj — HieRj^i- The tester 
computes and estimates those probabilities in Steps 2 and 3. In Step 4, the tester verifies that the probabilities 
of sets Rj in both the distributions are close. Finally, in Steps 5-7, the tester verifies that q restricted to each 
Rj is approximately uniform, by comparing second moments of p and q over each 7?^. If q passes the test 
with probability greater than 1/3, it must be close to p. On the other hand, if p = ^, then the parameters can 
be set so that q passes with probability 2/3. 

Note that the additive linear term in the complexity of the tester comes from explicitly computing each 
Ri and each P, in Steps 1-2. 

2 Our Improvement 

Note that the partition of [n] into sets Rj need not be computed explicitly, since for each sample / from q, 
one can check which Rj it belongs to by querying 

We observe that one can verify that || (Po, • • • ,Pk) — {Qoi • • • > Qk)\\i is small without explicitly computing 
each Pi. We use Algorithm [2] for this purpose. Let j* be an index such that an element of probability 1 / ^/n 
would belong to Rj*. The algorithm is based on the following facts: 

• For j < j*, if Pj is not negligible, Rj must be large, and a good additive estimate to Pj can be obtained 
by uniformly sampling 0{^/n -poly {I /e)) elements of [n], and computing the weight of those that 
belong to Rj. 

• If p = q,we. are likely to learn all elements in Rj, j > j*, by sampling only 0{y/n) elements of q. This 
gives the exact value of each Pj, j > j*. \f p ^ q, this method still gives lower bounds for each Pj. 

If \\{Po, ■ ■ ■ ,Pk) — {Q'o, ■ ■ ■ ,Q'k)\\i ^ S, our estimates for Pj and Qj are likely to be sufficiently different. A 
detailed proof follows. 

Lemma 1 Algorithm\2\with appropriately chosen constants tells p = q ( Case 1 )from \\{Po, ■ ■ ■ ,Pk) — {Qo, ■ ■ ■ , Qk)\\ 1 ^ 
6 (Case 2) with probability 9/10. 

Proof The multiplicative constant in the sample size of Step 1 is such that Step 1 succeeds with probability 
99/100. The size of 5i is chosen such that with probability 99/100, contains all elements / of probability 
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Algorithm 2: Telling p = q (Case 1) from ||(P(), • . • , A:) - (Go, • • • , G/t) II i > 5 (Case 2) 

1 Use 0{{k/hY ■ log^) samples from q to get an estimate Q'^ of each Qj up to 6/ (8^ + 8) 

2 Let 7* be an index such that an element of probability 1 / y/n would belong to Rj* 

3 Let be a set of 0{y/n-\ogn) samples from q 

4 for j s. t. f < j <kdo 
Let Tj = 5i n Rj (with no repetitions) 



if 



Q'j-LieT.Pi 



> 



8yt+8 



tiien return "Case 2" 



7 Let ^2 be a set of O 

8 for j s.t. j < j* do 
Let Uj = SjH Rj (with repetitions) 



I) y^-log^j independent samples from [n] with replacement 
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if 



Lieuj Pi 



> 



Ak+A 



then return "Case 2" 



11 return "Case 1" 



99/100, for each j < /, 
S 



<?! > '^^^ coupons collector's problem. Finally, the size of ^2 is chosen such that with probability 

— Pj < 8^:8- ^^^^ ^^^^ focus on j < 7* such that Pj > 

Ygpq^. Note that each / G ^2 contributes with a value in [0, l/\/n\ to Y,ieUjPi- the Chernoff bound, 
^ ((s)^ V^-logA;^ samples suffice to estimate Pj with multiplicative error 1 + with probability 1 — 
which implies additive error at most -^j^ as well. For j < j* such that Pj < j^ff^, the Chernoff 

bound still guarantees with the same probability that the estimate is less than 

If p = q, then Algorithm[2]discovers this with probability 97 / 100 due to the following facts. Firstly, Tj = 
Rj, for j > j*, so Y^ifzT. Pi = Pj. Therefore, provided all Q'j are good approximations to the corresponding Qj, 

q always passes Step 6. Secondly, if all — ^j— , < 7 < j*, aie good approximations of the corresponding 
Pj, q always passes Step 10 as well. 

If II {Po, ...,Pk)- (Go, . . . , Gi) II 1 > 5, there is / such that G/ > Pj' + ^- If / > /, then because Pj is 
always greater than or equal to Lier, Pi^ the tester concludes in Step 6 for j = f that Case 2 occurs, provided 
Q'-i is a good approximation to G/', which happens with probability at least 99/100. If / < j*, then because 
we have good approximations to both G/ and Pji with probability 98/100, and their distance is at least 
Ik ~ ^ the algorithm concludes in Step 10 for j = / that Case 2 occurs. ■ 

To get an efficient tester, we replace Steps 2^ of Algorithm \T\ with Algorithm |2j where we set 6 to 
e/C for a sufficiently large constant C. If Algorithm [2] concludes that Case 2 occurs, the new algorithm 
immediately rejects. Furthermore, if it is not the case that || {Pq, . . . ,Pk) — (Go, • • • , Qk)\\ 1 > 5, Steps 6 and 7 
work with estimates Q'j instead of the exact values Pj up to a modification of constants. 
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